Multi-omics signatures of the human early life exposome

Environmental exposures during early life play a critical role in life-course health, yet the molecular phenotypes underlying environmental effects on health are poorly understood. In the Human Early Life Exposome (HELIX) project, a multi-centre cohort of 1301 mother-child pairs, we associate individual exposomes consisting of >100 chemical, outdoor, social and lifestyle exposures assessed in pregnancy and childhood, with multi-omics profiles (methylome, transcriptome, proteins and metabolites) in childhood. We identify 1170 associations, 249 in pregnancy and 921 in childhood, which reveal potential biological responses and sources of exposure. Pregnancy exposures, including maternal smoking, cadmium and molybdenum, are predominantly associated with child DNA methylation changes. In contrast, childhood exposures are associated with features across all omics layers, most frequently the serum metabolome, revealing signatures for diet, toxic chemical compounds, essential trace elements, and weather conditions, among others. Our comprehensive and unique resource of all associations (https://helixomics.isglobal.org/) will serve to guide future investigation into the biological imprints of the early life exposome.

3 sun exposure, and puberty development of the child. Concentrations of drinking water disinfection by products (DBPs) during pregnancy were estimated from water company concentration and habits obtained from questionnaire data. Data was not sufficiently complete to estimate child exposure to DBPs. Indoor air concentrations of nitrogen dioxide (NO2), particulate matter <2.5μm (PM2.5), particulate matter absorbance (PMabs), benzene, and toluene, ethylbenzene, xylene (BTEX) were estimated through a prediction model that combined measurements in the homes of a subgroup of children with questionnaire data from the subcohort. Measurements of indoor NO2, benzene and TEX were conducted in the homes of 157 participants as part of the child panel study, which was nested within the HELIX subcohort in all cohorts except MoBa.
Participants in the child panel study were followed for one week in two periods (approximately 6 months apart), and the last day of the first week coincided with the subcohort examination, including the completion of the main HELIX questionnaire.
Exposures were either continuous variables or categorical variables with two or more levels.
Continuous exposure variables were transformed to achieve linearity or categorized, when needed. Missing data were imputed using a chained equations method 12 implemented in the mice v3.4.0 R package 13 . Twenty imputed datasets were created, although we only used the first imputation in this study. The correlation among exposures, within each exposure window or overall has been described in detail elsewhere 14

Child molecular phenotypes
Several child molecular phenotypes were measured as described in Supplementary Data S1E, and described below.

Blood DNA methylation
DNA was obtained from children's peripheral blood (buffy coat) collected in EDTA tubes. DNA was extracted using the Chemagen kit (Perkin Elmer, USA) in batches of 12 samples within each cohort. DNA concentration was determined in a Nanodrop 1000 UV-Vis Spectrophotometer (Thermo Fisher Scientific, USA) and also with Quant-iT TM PicoGreen  dsDNA Assay Kit (Life Technologies, USA). DNA extraction was repeated in around 8% of the blood samples as the DNA quantity or quality of the first extraction was low. Less than 1.5% of the samples were finally 4 excluded due to low quality.
DNA methylation was assessed with the Infinium HumanMethylatio450 beadchip (Illumina, USA) at the University of Santiago de Compostela -Spanish National Genotyping Center (CeGen-USC) (Spain). 700 ng of DNA were bisulfite-converted using the EZ 96-DNA kit (Zymo Research, USA) following the manufacturer's standard protocol. All samples of the study were randomized considering sex and cohort. In addition, each plate contained a HapMap control sample and 24 HELIX inter-plate duplicates were included.
After an initial inspection of the quality of the methylation data with the MethylAid v1.8.0 R package 15 , probes with a call rate <95% based on a detection p-value of 1E-16 and samples with a call rate <98% were removed 16 . Samples with discordant sex were eliminated from the study as well as duplicates with inconsistent genotypes and samples with inconsistent genotypes respect to existing genome-wide genotyping array data. Methylation data was normalized using the functional normalization method with prior background correction with Noob 17 . Then, some probes were filtered out: control probes, probes to detect single nucleotide polymorphisms (SNPs), probes to detect methylation in non-CpG sites, probes located in sexual chromosomes, cross hybridizing probes 18 , probes containing a SNP at any position of the sequence with a minor allele frequency (MAF) >5% and probes with a SNP at the CpG site or at the single base extension (SBE) at any MAF in the combined population from 1000 Genomes Project. Batch effects and blood cell composition were corrected by calculating 134 surrogate variables while protecting cohort, sex and age with the surrogate variable analysis (SVA) method 19  Normalization using the selected miRNA reference set was done by applying the variance stabilization and calibration for microarray data (vsn) method 25 . Normalized miRNA levels were log2 transformed and annotated using a combination of information from Agilent annotation ("Annotation_7056") and miRbase v21 (GRCh38 and mapped back to hg19) released in January 2017 ("annotation_miRBase_GRCh38_coordinates-gff3"). Then control probes, miRNAs in sexual chromosomes and unannotated miRNAs were excluded from the database. Batch effects and blood cell composition was corrected as described above with a total of 35 surrogate variables. Finally, miRNAs with a call rate <1% were also filtered out. A miRNA was considered as detected if its expression was different from the background or the standard error of its different probes was smaller than 3 times the expression signal. After exclusion of control samples and control probes, the average number of detected miRNAs was 420.43 (minimum: 71; maximum: 940).

Plasma proteins
Plasma protein levels were assessed using the antibody-based multiplexed platform from Raw intensities obtained with the xMAP and Luminex system for each plasma sample were converted to pg/ml using the calculated standard curves of each plate and accounting for the dilutions made prior measurement. The percentages of coefficients of variation (CV%) for each protein by plate ranged from 3% to 36%. The limit of detection (LOD) and the lower and upper limit of quantification (LOQ1 and LOQ2, respectively) were estimated by plate, and then averaged.
Only proteins with >30% of measurements in the linear range of quantification were kept in the database and the others were removed. Seven proteins were measured twice (in two different multiplex kits). We kept the measure with higher quality. The 36 proteins that passed the quality control criteria mentioned above were log2 transformed (Supplementary Data 1F) 26 . Then, the plate batch effect was corrected by subtracting the plate specific average for each protein minus the overall average of all plates for that protein. After that, values below the LOQ1 and above the LOQ2 were imputed using a truncated normal distribution implemented in the truncdist R v1.0-2 package 27 . Twenty samples were excluded due to having ten or more proteins out of the linear range of quantification.

Serum metabolites
The AbsoluteIDQTM p180 kit was chosen for serum analysis as it is a standardised, targeted LC-MS/MS assay, widely used for large-scale epidemiology studies and its inter-laboratory reproducibility has been demonstrated by several independent laboratories 28  We also excluded one HELIX sample, which was hemolyzed.
Concentration levels were log2 transformed.

Urinary metabolites
Two urine samples, representing last night-time and first morning voids, were collected on the evening and morning before the clinical examination, kept in a fridge and transported in a temperature-controlled environment, and aliquoted and frozen within 3 h of arrival at the clinics.
They were subsequently pooled to generate a more representative sample of the last 24 h for 9 metabolomic analysis.
Urinary metabolic profiles were acquired using 1 H NMR spectroscopy according to 29 . In brief onedimensional 600 MHz 1 H NMR spectra of urine samples from each cohort were acquired on the same Bruker Avance III spectrometer operating at 14.1 Tesla within a period of 1 month. The spectrometer was equipped with a Bruker SampleJet system, and a 5-mm broad-band inverse configuration probe maintained at 300K. Prior to analysis, cohort samples were randomised.  29 . The urinary NMR showed excellent analytical performance, the mean coefficient of variation across the 44 NMR detected urinary metabolites was 11%. Data was normalized using the median fold change normalization method 30 , which takes into account the distribution of relative levels of all 44 metabolites compared to the reference sample in determining the most probable dilution factor. An offset of ½ of the minimal value was applied and then concentration levels were expressed as log2.

Statistical analysis (Exposome-omics-Wide Association Study -ExWAS)
To test the relationship between the pregnancy and childhood exposomes and molecular features, we fitted linear regressions between each exposure variable and each molecular feature adjusting for covariates, using the limma v3.46.0 R package 24  Main covariates for all omics were: cohort, child's sex, child's age, child sex and age z-score BMI calculated according to WHO reference curves 33,34 , child's ethnicity defined in three categories (European ancestry; Pakistani or Asian; and other), and self-reported maternal education categorized in low (primary school), medium (secondary school) and high (university degree or higher). Ethnic origin (African, Asian, white European, Mexican, and other) was asked to the families as part of the spirometry protocol. This information was combined with existing data (parent's ethnic origin and/or parent's country of birth) in each of the cohorts. Missing data for covariates was imputed as described above for the exposome.
In addition, models for each omics were adjusted for specific covariates: (i) plasma protein models were adjusted for time to last meal and hour of blood collection, (ii) serum metabolite models for time to last meal, hour of blood collection, and technical batch, (iii) urinary metabolite models for sample type (bedtime, morning or pool), and technical batch. Blood methylation and transcriptomics data were corrected by surrogate variables as described above, which captured both batch effects and blood cell type composition.
In all omics, except for methylation, the effect size is reported as a log2 fold change (log2FC) of the molecular phenotype levels between categories of discrete exposure variables or for

Sensitivity analyses
A set of sensitivity analyses were conducted. First, models were run again without adjustment for child z-BMI. The difference in the effect size among main models and z-BMI unadjusted models was calculated as (effect size main modeleffect size alternative model) / effect size main model *100. Second, analyses were restricted to children of European ancestry (90%). Third, top hit associations were run by cohort and combined through fixed-and random-effects inverse variance weighted meta-analyses using the meta v4.16-1 R package 40 , and forest-plots were visually inspected. I 2 was used to evaluate heterogeneity in the results across cohorts.

Exposure-omics network analyses
We took the list of associations passing multiple testing correction (N=1,170) and conducted network analyses. We build a network for each exposure period (period-specific), and each network contained all the molecular layers (multi-omics). Molecular features and exposures were considered the nodes of the network and the edges were the associations between them.
Networks visualization was carried out using Cytoscape v3.9.0 (http://cytoscape.org). The two networks were automatically arranged using the Cytoscape force-directed layout which aims to highlight the underlying topology of the graph 41

Functional enrichment analyses
Functional enrichment analyses were restricted to molecular layers with features which could be easily annotated at the gene level: DNA methylation, gene and miRNA transcription, and proteins.
Enrichment was investigated using a variety of databases following the same workflow. We annotated molecular features to genes within each pregnancy and childhood cluster and performed functional enrichment analyses. Second, for pregnancy and childhood exposures with at least one significant association with molecular features after multiple-testing correction, we retrieved all molecular features associated at p-value <1E-03. After that, we annotated these features to genes as described in the omics data acquisition sections, and obtained a unique list of "dysregulated" genes by combining genes detected in any of the molecular layers. The gene universe was the list of genes in the methylation study, annotated with the IlluminaHumanMethylation450kanno.ilmn12.hg19 v0.6.0 (N=21,235).
To test associations between DNA methylation levels and gene expression levels in cis (cis-eQTMs), we paired each transcript cluster (TC) to all CpGs closer than 500 kb from its TSS, either upstream or downstream. For each CpG-TC pair we fitted a linear regression model between 13 gene expression and methylation levels adjusted for age, sex and cohort. Methylation slide batch effects were controlled using ComBat, while for gene expression we eliminated the effect of surrogate variables protecting covariates in the model (age, sex and cohort) as well as blood cell type proportions. The analysis was restricted to children of European ancestry and autosomal chromosomes. To ensure that CpGs paired to a higher number of TCs do not have higher chances of being part of an eQTM, multiple-testing was controlled at the CpG level. More details on the analyses and the multiple-testing correction can be found elsewhere 47 .